Show the code
pacman::p_load(autoplotly, dplyr, DT, forecast, ggplot2, ggthemes, ggridges, ggstatsplot, ggiraph, ggfortify, hrbrthemes, imputeTS, knitr, lubridate, MLmetrics, plotly,tidyverse, tsbox, tseries, xts)March 1, 2024
March 30, 2024
autoplotly: to automatically generate interactive visualizations for statistical results supported by ‘ggfortify’, such as time series, PCA, clustering and survival analysis.
dplyr: to allow for data manipulation
DT: to provide an R interface to the JavaScript Library DataTables. R data objects can be displayed as tables on HTML pages and DataTables provides filtering, pagination, sorting and many other features in the tables.
forecast: to display and analyse univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.
ggplot2: to plot charts in order to visualise our data
ggthemes: additional themes to complete ggplot2
ggridges: to plot ridgeline plots which are useful in visualising changes in distribution over time or space
ggstatsplot: To create graphics with details from statistical tests included in the plots themselves - generate information-rich plots for statistical analysis of continuous (violin plots, scatterplots, histograms, dot plots, dot-and-whisker plots) or categorical (pie and bar charts) data.
ggiraph: to create dynamic ggplot graphs
ggfortify: Unified plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering and survival analysis. The package offers a single plotting interface for these analysis results and plots in a unified style using ‘ggplot2’.
hrbrthemes: additional themes to complete ggplot2
imputeTS: to allow us to impute missing values in our time series objects
knitr: Provides a general-purpose tool for dynamic report generation in R using Literate Programming techniques.
lubridate: to manipulate and handle date-times.
MLmetrics: a collection of evaluation metrics, including loss, score and utility functions, that measure regression, classification and ranking performance.
plotly: to create interactive charts to explore our data
tidyverse: to manipulate and wrangle data
tsbox: to allow us to handle time series as plain data frames, thus making it easy to deal with time series in a dplyr or data.table workflow.
tseries: Time series analysis and computational finance.
xts: provides an extensible time series class, enabling uniform handling of many R time series classes by extending zoo, which is the package that is the creator for an S3 class of indexed totally ordered observations which includes irregular time series.
We used historical daily weather records from weather.gov.sg, and retrieved the daily records data from Jan 1980 to Dec 2023 via data.gov.sg’s API. The daily historical weather records is in csv file format.
There is no date but there are columns year, month and day. In addition, we also note that R read columns year, month and day as numeric data.
There are different columns for rainfall and temperature so we will select and filter the relevant columns that we want in subsequent steps.
The entire dataset daily_historical.csv is very large. We should save the filtered data into an R data format (RDS) so that we can easily retrieve it in future without importing the entire daily_historical.csv dataset again.
Let us first create a date column, called tdate using paste(), mutate() and lubridate’s ymd() to convert the numeric data type into a date data type and year-month-day format.
Now we will select the relevant temperature related columns needed for this exerise using select(). We will examine the resultant dataframe temp using str().
temp) is a tibble dataframe with the following columns:station: refers to the weather station that collected this temperature reading.tdate: refers to the date of the temperature reading collected.mean_temperature: refers to the mean temperature reading of that day.maximum_temperature: refers to the highest temperature reading of that day.minimum_temperature: refers to the lowest temperature reading of that day.Let us save the dataframe into an RDS file for future usage using write_rds().
We will bring in the temperature data using read_rds(). Let us check the imported RDS data using str() again.
tibble [329,156 × 5] (S3: tbl_df/tbl/data.frame)
$ station : chr [1:329156] "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" ...
$ tdate : Date[1:329156], format: "1980-01-01" "1980-01-02" ...
$ mean_temperature : num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
$ maximum_temperature: num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
$ minimum_temperature: num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
tdate as date data type and the temperature related columns are seen as numeric.First, let us use summary to have a sense of the missing data.
station tdate mean_temperature maximum_temperature
Length:329156 Min. :1980-01-01 Min. :22.20 Min. :22.80
Class :character 1st Qu.:1997-04-29 1st Qu.:27.10 1st Qu.:30.50
Mode :character Median :2011-09-18 Median :27.90 Median :31.70
Mean :2007-07-02 Mean :27.87 Mean :31.49
3rd Qu.:2017-11-27 3rd Qu.:28.80 3rd Qu.:32.60
Max. :2023-12-31 Max. :31.50 Max. :38.00
NA's :58 NA's :255645 NA's :255282
minimum_temperature
Min. :20.0
1st Qu.:24.3
Median :25.2
Mean :25.3
3rd Qu.:26.3
Max. :30.0
NA's :255283
The observations ranged from 1 Jan 1980 to 31 Dec 2023. There are 58 rows with missing dates. We should drop these rows since they are unable to tell us which day the observations were made (even if they have temperature readings).
There are 255,645 rows of NAs in daily mean temperature.
There are 255,282 rows of NAs in daily maximum temperature.
There are 255,283 rows of NAs in daily minimum temperature.
We noted that there are a lot of missing values. As the aim of this task is to forecast future temperatures, missing value treatment is not so straight-forward. Imputation using mean, median & mode might hide trends or seasonal patterns whereas removing missing data points altogether might reduce information contained in other features. Let’s understand more about the missing values before we proceed to do imputation for missing values.
First, let us drop those rows where date is missing because we would not be able to definitively identify when the temperature(s) were collected (even if there were temperature readings for these rows.
station tdate mean_temperature maximum_temperature
Length:329098 Min. :1980-01-01 Min. :22.20 Min. :22.80
Class :character 1st Qu.:1997-04-29 1st Qu.:27.10 1st Qu.:30.50
Mode :character Median :2011-09-18 Median :27.90 Median :31.70
Mean :2007-07-02 Mean :27.87 Mean :31.49
3rd Qu.:2017-11-27 3rd Qu.:28.80 3rd Qu.:32.60
Max. :2023-12-31 Max. :31.50 Max. :38.00
NA's :255587 NA's :255224
minimum_temperature
Min. :20.0
1st Qu.:24.3
Median :25.2
Mean :25.3
3rd Qu.:26.3
Max. :30.0
NA's :255225
We noted that there are many weather stations in the temp dataframe. Hence, we will make use of plotly to further explore the missing temperatures.
Let us first explore the daily mean temperatures by selecting the relevant columns and pivot the dataframe wider.
tdate Macritchie Reservoir Lower Peirce Reservoir
Min. :1980-01-01 Min. : NA Min. : NA
1st Qu.:1990-12-31 1st Qu.: NA 1st Qu.: NA
Median :2001-12-31 Median : NA Median : NA
Mean :2001-12-31 Mean :NaN Mean :NaN
3rd Qu.:2012-12-30 3rd Qu.: NA 3rd Qu.: NA
Max. :2023-12-31 Max. : NA Max. : NA
NA's :16071 NA's :16071
Admiralty East Coast Parkway Ang Mo Kio Newton
Min. :22.50 Min. :23.40 Min. :22.50 Min. :22.20
1st Qu.:26.80 1st Qu.:27.50 1st Qu.:26.90 1st Qu.:26.80
Median :27.60 Median :28.20 Median :27.80 Median :27.70
Mean :27.66 Mean :28.13 Mean :27.82 Mean :27.58
3rd Qu.:28.50 3rd Qu.:28.90 3rd Qu.:28.80 3rd Qu.:28.40
Max. :30.80 Max. :30.80 Max. :31.20 Max. :30.60
NA's :10821 NA's :10806 NA's :10918 NA's :11148
Lim Chu Kang Marine Parade Choa Chu Kang (Central) Tuas South
Min. : NA Min. : NA Min. : NA Min. :23.10
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.:27.40
Median : NA Median : NA Median : NA Median :28.20
Mean :NaN Mean :NaN Mean :NaN Mean :28.19
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:29.00
Max. : NA Max. : NA Max. : NA Max. :31.00
NA's :16071 NA's :16071 NA's :16071 NA's :11609
Pasir Panjang Jurong Island Nicoll Highway Botanic Garden
Min. :23.20 Min. :23.40 Min. : NA Min. : NA
1st Qu.:27.50 1st Qu.:27.60 1st Qu.: NA 1st Qu.: NA
Median :28.30 Median :28.40 Median : NA Median : NA
Mean :28.25 Mean :28.29 Mean :NaN Mean :NaN
3rd Qu.:29.10 3rd Qu.:29.10 3rd Qu.: NA 3rd Qu.: NA
Max. :31.30 Max. :31.10 Max. : NA Max. : NA
NA's :11049 NA's :11748 NA's :16071 NA's :16071
Choa Chu Kang (South) Whampoa Changi Jurong Pier
Min. :22.70 Min. : NA Min. :22.80 Min. : NA
1st Qu.:26.80 1st Qu.: NA 1st Qu.:26.90 1st Qu.: NA
Median :27.70 Median : NA Median :27.70 Median : NA
Mean :27.68 Mean :NaN Mean :27.69 Mean :NaN
3rd Qu.:28.60 3rd Qu.: NA 3rd Qu.:28.60 3rd Qu.: NA
Max. :31.00 Max. : NA Max. :30.90 Max. : NA
NA's :11558 NA's :16071 NA's :731 NA's :16071
Ulu Pandan Mandai Tai Seng Jurong (West)
Min. : NA Min. : NA Min. :23.2 Min. :22.20
1st Qu.: NA 1st Qu.: NA 1st Qu.:27.6 1st Qu.:26.60
Median : NA Median : NA Median :28.4 Median :27.40
Mean :NaN Mean :NaN Mean :28.4 Mean :27.41
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:29.3 3rd Qu.:28.30
Max. : NA Max. : NA Max. :31.5 Max. :30.60
NA's :16071 NA's :16071 NA's :11454 NA's :10995
Clementi Sentosa Island Bukit Panjang Kranji Reservoir
Min. :22.80 Min. :23.00 Min. : NA Min. : NA
1st Qu.:26.90 1st Qu.:27.40 1st Qu.: NA 1st Qu.: NA
Median :27.70 Median :28.20 Median : NA Median : NA
Mean :27.63 Mean :28.12 Mean :NaN Mean :NaN
3rd Qu.:28.50 3rd Qu.:28.90 3rd Qu.: NA 3rd Qu.: NA
Max. :30.60 Max. :31.10 Max. : NA Max. : NA
NA's :11258 NA's :11317 NA's :16071 NA's :16071
Upper Peirce Reservoir Kent Ridge Queenstown Tanjong Katong
Min. : NA Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA Max. : NA
NA's :16071 NA's :16071 NA's :16071 NA's :16071
Somerset (Road) Punggol Simei Toa Payoh
Min. : NA Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA Max. : NA
NA's :16071 NA's :16071 NA's :16071 NA's :16071
Tuas Bukit Timah Pasir Ris (Central)
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :16071 NA's :16071 NA's :16071
We will make use of plotly to explore the missing daily temperatures for each station using a dropdown list.
plot_ly(data = temp_mean_wide,
x = ~tdate,
y = ~ Admiralty,
type = "scatter",
mode = "lines") |>
layout(title = "Temperature observed by Weather Stations",
xaxis = list(title = "Date"),
yaxis = list(title = "", range = c(0,40)),
theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,
axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),
updatemenus = list(list(type = 'dropdown',
xref = "paper",
yref = "paper",
xanchor = "left",
x = 0.04,
y = 0.95,
buttons = list(
list(method = "update",
args = list(list(y = list(temp_mean_wide$Admiralty)),
list(yaxis = list(title = "Temperature in Admiralty", range = c(0,40)))),label = "Admiralty"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`East Coast Parkway`)),
list(yaxis = list(title = "Temperature in East Coast Parkway", range = c(0,40)))),label = "East Coast Parkway"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Ang Mo Kio`)),
list(yaxis = list(title = "Temperature in Ang Mo Kio", range = c(0,40)))),label = "Ang Mo Kio"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Newton)),
list(yaxis = list(title = "Temperature in Newton", range = c(0,40)))),label = "Newton"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tuas South`)),
list(yaxis = list(title = "Temperature in Tuas South", range = c(0,40)))),label = "Tuas South"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Pasir Panjang`)),
list(yaxis = list(title = "Temperature in Pasir Panjang", range = c(0,40)))),label = "Pasir Panjang"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Jurong Island`)),
list(yaxis = list(title = "Temperature in Jurong Island", range = c(0,40)))),label = "Jurong Island"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (South)`)),
list(yaxis = list(title = "Temperature in Choa Chu Kang (South)", range = c(0,40)))),label = "Choa Chu Kang (South)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Changi)),
list(yaxis = list(title = "Temperature in Changi", range = c(0,40)))),label = "Changi"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tai Seng`)),
list(yaxis = list(title = "Temperature in Tai Seng", range = c(0,40)))),label = "Tai Seng"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Jurong (West)`)),
list(yaxis = list(title = "Temperature in Jurong West", range = c(0,40)))),label = "Jurong West"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Clementi)),
list(yaxis = list(title = "Temperature in Clementi", range = c(0,40)))),label = "Clementi"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Sentosa Island`)),
list(yaxis = list(title = "Temperature in Sentosa", range = c(0,40)))),label = "Sentosa"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Macritchie Reservoir`)),
list(yaxis = list(title = "Temperature at Macritchie Reservoir", range = c(0,40)))),label = "Macritchie Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Lower Peirce Reservoir`)),
list(yaxis = list(title = "Temperature at Lower Peirce Reservoir", range = c(0,40)))),label = "Lower Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Lim Chu Kang`)),
list(yaxis = list(title = "Temperature at Lim Chu Kang", range = c(0,40)))),label = "Lim Chu Kang"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Marine Parade`)),
list(yaxis = list(title = "Temperature at Marine Parade", range = c(0,40)))),label = "Marine Parade"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (Central)`)),
list(yaxis = list(title = "Temperature at Choa Chu Kang (Central)", range = c(0,40)))),label = "Choa Chu Kang (Central)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (Central)`)),
list(yaxis = list(title = "Temperature at Choa Chu Kang (Central)", range = c(0,40)))),label = "Choa Chu Kang (Central)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Nicoll Highway`)),
list(yaxis = list(title = "Temperature at Nicoll Highway", range = c(0,40)))),label = "Nicoll Highway"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Botanic Garden`)),
list(yaxis = list(title = "Temperature at Botanic Garden", range = c(0,40)))),label = "Botanic Garden"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Whampoa)),
list(yaxis = list(title = "Temperature at Whampoa", range = c(0,40)))),label = "Whampoa"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Jurong Pier`)),
list(yaxis = list(title = "Temperature at Jurong Pier", range = c(0,40)))),label = "Jurong Pier"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Ulu Pandan`)),
list(yaxis = list(title = "Temperature at Ulu Pandan", range = c(0,40)))),label = "Ulu Pandan"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Mandai)),
list(yaxis = list(title = "Temperature at Mandai", range = c(0,40)))),label = "Mandai"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Bukit Panjang`)),
list(yaxis = list(title = "Temperature at Bukit Panjang", range = c(0,40)))),label = "Bukit Panjang"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Kranji Reservoir`)),
list(yaxis = list(title = "Temperature at Kranji Reservoir", range = c(0,40)))),label = "Kranji Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Upper Peirce Reservoir`)),
list(yaxis = list(title = "Temperature at Upper Peirce Reservoir", range = c(0,40)))),label = "Upper Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Kent Ridge`)),
list(yaxis = list(title = "Temperature at Kent Ridge", range = c(0,40)))),label = "Kent Ridge"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Queenstown)),
list(yaxis = list(title = "Temperature at Queenstown", range = c(0,40)))),label = "Queenstown"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tanjong Katong`)),
list(yaxis = list(title = "Temperature at Tanjong Katong", range = c(0,40)))),label = "Tanjong Katong"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Somerset (Road)`)),
list(yaxis = list(title = "Temperature at Somerset (Road)", range = c(0,40)))),label = "Somerset (Road)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Punggol`)),
list(yaxis = list(title = "Temperature at Punggol", range = c(0,40)))),label = "Punggol"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Simei`)),
list(yaxis = list(title = "Temperature at Simei", range = c(0,40)))),label = "Simei"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Toa Payoh`)),
list(yaxis = list(title = "Temperature at Toa Payoh", range = c(0,40)))),label = "Toa Payoh"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tuas`)),
list(yaxis = list(title = "Temperature at Tuas", range = c(0,40)))),label = "Tuas"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Bukit Timah`)),
list(yaxis = list(title = "Temperature at Bukit Timah", range = c(0,40)))),label = "Bukit Timah"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Pasir Ris (Central)`)),
list(yaxis = list(title = "Temperature at Pasir Ris (Central)", range = c(0,40)))),label = "Pasir Ris (Central)")
)))) It seems like there are some weather stations with no temperature data at all. We should remove them from the temp dataframe.
There are some weather stations (e.g. Admiralty) have temperature data only from a certain year onwards (e.g. 2009), and some stations (e.g. Changi) have temperature data as early as 1980s.
For those weather stations with temperature data, they also have missing values over a given time period. So we will need to decide how to handle these missing values in subsequent sections.
Let us identify the amount of missing values for each weather station using the following code chunk.
missing_values <- temp_mean_wide %>%
gather(key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
group_by(key) %>%
mutate(total = n()) %>%
group_by(key, total, isna) %>%
summarise(num.isna = n()) %>%
mutate(pct = num.isna / total * 100)
levels <-
(missing_values %>% filter(isna == T) %>% arrange(desc(pct)))$key
percentage_plot <- missing_values %>%
ggplot() +
geom_bar(aes(x = reorder(key, desc(pct)),
y = pct, fill=isna),
stat = 'identity', alpha=0.8) +
scale_x_discrete(limits = levels) +
scale_fill_manual(name = "",
values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
coord_flip() +
labs(title = "Percentage of missing values", x =
'Variable', y = "% of missing values")
percentage_plot
The above output is consistent with what we observed when exploring the data using plotly. There are numerous stations without temperature readings throughout all years and there are certain stations with temperature readings during certain time periods.
Let us find out which stations that have no temperature readings throughout the entire time period using filter().We will filter out those weather stations that have 100% NAs.
[1] "Botanic Garden" "Bukit Panjang"
[3] "Bukit Timah" "Choa Chu Kang (Central)"
[5] "Jurong Pier" "Kent Ridge"
[7] "Kranji Reservoir" "Lim Chu Kang"
[9] "Lower Peirce Reservoir" "Macritchie Reservoir"
[11] "Mandai" "Marine Parade"
[13] "Nicoll Highway" "Pasir Ris (Central)"
[15] "Punggol" "Queenstown"
[17] "Simei" "Somerset (Road)"
[19] "Tanjong Katong" "Toa Payoh"
[21] "Tuas" "Ulu Pandan"
[23] "Upper Peirce Reservoir" "Whampoa"
From the above output, we know that these 24 weather stations have no temperature readings. We will put them into a list and create an operator to exclude them from the temp data using filter().
stationstoremove <- c("Botanic Garden","Bukit Panjang","Bukit Timah","Choa Chu Kang (Central)","Jurong Pier","Kent Ridge", "Kranji Reservoir", "Lim Chu Kang", "Lower Peirce Reservoir", "Macritchie Reservoir","Mandai", "Marine Parade","Nicoll Highway", "Pasir Ris (Central)", "Punggol", "Queenstown","Simei", "Somerset (Road)","Tanjong Katong", "Toa Payoh", "Tuas", "Ulu Pandan", "Upper Peirce Reservoir","Whampoa")
#create a operator to exclude things
'%!in%' <- function(x,y)!('%in%'(x,y))
#excluded stations that have no temp data at all
temp_clean <- temp %>%
filter(station %!in% stationstoremove)
glimpse(temp_clean)Rows: 120,139
Columns: 5
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ tdate <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-04, 2…
$ mean_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maximum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ minimum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
[1] "Admiralty" "East Coast Parkway" "Ang Mo Kio"
[4] "Newton" "Tuas South" "Pasir Panjang"
[7] "Jurong Island" "Choa Chu Kang (South)" "Changi"
[10] "Tai Seng" "Jurong (West)" "Clementi"
[13] "Sentosa Island"
We will then pivot the temp_clean dataframe wider and plot the daily temperature for the remaining weather stations using plotly.
Rows: 16,071
Columns: 14
$ tdate <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-0…
$ Admiralty <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `East Coast Parkway` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Ang Mo Kio` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Newton <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Tuas South` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Pasir Panjang` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Jurong Island` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Choa Chu Kang (South)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Changi <dbl> 26.6, 26.4, 26.5, 26.3, 27.0, 27.4, 27.1, 27.0…
$ `Tai Seng` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Jurong (West)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Clementi <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Sentosa Island` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
plot_ly(data = temp_mean_widec,
x = ~tdate,
y = ~ Admiralty,
type = "scatter",
mode = "lines+markers") |>
layout(title = "Temperature observed by Weather Station",
xaxis = list(title = "Date"),
yaxis = list(title = "", range = c(0,40)),
theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,
axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),
updatemenus = list(list(type = 'dropdown',
xref = "paper",
yref = "paper",
xanchor = "left",
x = 0.04,
y = 0.95,
buttons = list(
list(method = "update",
args = list(list(y = list(temp_mean_widec$Admiralty)),
list(yaxis = list(title = "Temperature in Admiralty", range = c(0,40)))),label = "Admiralty"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`East Coast Parkway`)),
list(yaxis = list(title = "Temperature in East Coast Parkway", range = c(0,40)))),label = "East Coast Parkway"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Ang Mo Kio`)),
list(yaxis = list(title = "Temperature in Ang Mo Kio", range = c(0,40)))),label = "Ang Mo Kio"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$Newton)),
list(yaxis = list(title = "Temperature in Newton", range = c(0,40)))),label = "Newton"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Tuas South`)),
list(yaxis = list(title = "Temperature in Tuas South", range = c(0,40)))),label = "Tuas South"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Pasir Panjang`)),
list(yaxis = list(title = "Temperature in Pasir Panjang", range = c(0,40)))),label = "Pasir Panjang"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Jurong Island`)),
list(yaxis = list(title = "Temperature in Jurong Island", range = c(0,40)))),label = "Jurong Island"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Choa Chu Kang (South)`)),
list(yaxis = list(title = "Temperature in Choa Chu Kang", range = c(0,40)))),label = "Choa Chu Kang"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$Changi)),
list(yaxis = list(title = "Temperature in Changi", range = c(0,40)))),label = "Changi"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Tai Seng`)),
list(yaxis = list(title = "Temperature in Tai Seng", range = c(0,40)))),label = "Tai Seng"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Jurong (West)`)),
list(yaxis = list(title = "Temperature in Jurong West", range = c(0,40)))),label = "Jurong West"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$Clementi)),
list(yaxis = list(title = "Temperature in Clementi", range = c(0,40)))),label = "Clementi"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Sentosa Island`)),
list(yaxis = list(title = "Temperature in Sentosa", range = c(0,40)))),label = "Sentosa")
)))) From the above interactive chart, we note that some stations have a longer time period with temperature readings (e.g. Changi). Almost all stations have some missing time gaps/ values in the data, hence we will need to do imputation for this missing values to ensure better accuracy of our temperature forecasting.
Also, we noted that daily temperature readings that range more than 20 years is too frequent for time series forecasting. Hence, we will aggregate the daily temperature readings to monthly temperature readings by calculating the mean in subsequent section.
In the previous sections, we noted that the dataframes were all tibble dataframes. For us to make use of the time series forecasting packages and their functions, we would need to convert the tibble dataframe into a time series object.
Before we create the time series object, let us first aggregate the daily temperature readings to monthly temperature readings by (1) creating the year-month column for each observation using floor_date() and specifying it to derive the year and month of each observation, and (2) aggregate the temperature readings by station and year_month then use summarise() to compute the monthly averages for mean_temperature, maximum_temperature and minimum_temperature.
Rows: 120,139
Columns: 6
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ tdate <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-04, 2…
$ mean_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maximum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ minimum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ year_month <date> 2009-01-01, 2009-01-01, 2009-01-01, 2009-01-01, 2…
Rows: 3,947
Columns: 5
Groups: station [13]
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ year_month <date> 2009-01-01, 2009-02-01, 2009-03-01, 2009-04-01, 2…
$ mean_temperature <dbl> NA, 26.76786, NA, 28.12000, 28.48387, 28.89667, 28…
$ maximum_temperature <dbl> NA, 31.44286, NA, 32.19667, 32.59032, 32.87000, 31…
$ minimum_temperature <dbl> NA, 24.26071, NA, 25.06667, 25.09355, 25.95000, 25…
With the monthly temperature of all weather stations, let us filter out one weather station (e.g. Admiralty) to create a tibble data frame adm so that we can convert it into an xts object, which is a type of time series object.
station year_month mean_temperature maximum_temperature
Length:179 Min. :2009-01-01 Min. :25.61 Min. :28.86
Class :character 1st Qu.:2012-09-16 1st Qu.:27.10 1st Qu.:31.33
Mode :character Median :2016-07-01 Median :27.74 Median :31.85
Mean :2016-06-19 Mean :27.68 Mean :31.83
3rd Qu.:2020-03-16 3rd Qu.:28.29 3rd Qu.:32.45
Max. :2023-12-01 Max. :29.15 Max. :33.87
NA's :22 NA's :18
minimum_temperature
Min. :23.63
1st Qu.:24.52
Median :24.92
Mean :24.98
3rd Qu.:25.46
Max. :26.45
NA's :18
We will use xts() from xts package to create a time series object. The order.by parameter uses the dates from the adm dataframe. We then use the ts_regular() function to give the time series object adm_xts a regular interval by adding NA values for missing dates.
Just in case there are missing months which we did not detected, we use the na.fill() function fills in those missing dates by extending values from previous days.
Let us plot out the monthly mean temperature of Admiralty weather station using ggplotly.
From the above output, we see that there are missing temperatures for numerous time periods. As a result, the line for the above chart is not continuous.
Let us investigate further using imputeTS package’s ggplot_na_distribution, which highlights the missing values in our data. For the following example, we focus on the mean temperature of the adm time series object.

We also use the imputeTS package’s statsNA to have a report on the number of missing mean temperature readings.
[1] "Length of time series:"
[1] 180
[1] "-------------------------"
[1] "Number of Missing Values:"
[1] 23
[1] "-------------------------"
[1] "Percentage of Missing Values:"
[1] "12.8%"
[1] "-------------------------"
[1] "Number of Gaps:"
[1] 12
[1] "-------------------------"
[1] "Average Gap Size:"
[1] 1.916667
[1] "-------------------------"
[1] "Stats for Bins"
[1] " Bin 1 (45 values from 1 to 45) : 3 NAs (6.67%)"
[1] " Bin 2 (45 values from 46 to 90) : 4 NAs (8.89%)"
[1] " Bin 3 (45 values from 91 to 135) : 11 NAs (24.4%)"
[1] " Bin 4 (45 values from 136 to 180) : 5 NAs (11.1%)"
[1] "-------------------------"
[1] "Longest NA gap (series of consecutive NAs)"
[1] "6 in a row"
[1] "-------------------------"
[1] "Most frequent gap size (series of consecutive NA series)"
[1] "1 NA in a row (occurring 7 times)"
[1] "-------------------------"
[1] "Gap size accounting for most NAs"
[1] "1 NA in a row (occurring 7 times, making up for overall 7 NAs)"
[1] "-------------------------"
[1] "Overview NA series"
[1] " 1 NA in a row: 7 times"
[1] " 2 NA in a row: 2 times"
[1] " 3 NA in a row: 2 times"
[1] " 6 NA in a row: 1 times"
There are several ways to impute missing data in time series objects. We need to impute missing values because some of the models cannot handle NAs in Time Series objects.
As this function calculates moving averages based on the last n observations, it will generally be performing better than using mean, mode and median imputation. Moving averages work well when data has a linear trend. This function also allows us to use linear-weighted and exponentially-weighted moving averages.
adm_imp_movingavg <- na_ma(adm_xts, weighting = "exponential") #default is exponential. Other options are "simple" and "linear". We can allow users to choose if the option they want.
#plot chart
#ggplot(adm_imp_movingavg, aes(x = Index, y = mean_temperature)) +
#geom_line()
plot_ma<- ggplot(adm_imp_movingavg, aes(x = Index, y = mean_temperature)) +
geom_line() + theme_clean() +
labs(title = "Monthly Mean Temperature of Admiralty Weather Station \n(missing values imputed using moving average)") +
xlab("Month-Year") +
ylab("Temperature in degrees celsius") +
theme_ipsum_rc()
ggplotly(plot_ma)We can also use Kalman Smoothing on ARIMA model to impute the missing values.

From the above output, we see that some of the imputed values are below 0 degrees celsius which is impossible in Singapore. As such, we will not be using this method to impute missing values for temperature readings.
Kalman Smoothing also has a “StrucTS” option. Let us try and see how it works for our temperature data.

From the above output, it seems like using the “StrucTS” model works better than auto.arima model since the imputed results were reasonable. Again, we can also let users choose which model they want to use.
Before we model the time series forecasting model, let us test is our time series data is stationary. Stationarity signifies that the statistical properties of time series, such as mean, variance, and covariance, remain constant over time, which is the fundamental assumption for many time series modeling techniques.It simplifies the complex dynamics within the data, making it more amenable to analysis, modeling, and forecasting.
There are two tests we are use to test for stationarity: - Augmented Dickey-Fuller (ADF) Test; and - Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test
Null Hypothesis: Series is non-stationary, or series has a unit root. Alternative Hypothesis: Series is stationary, or series has no unit root.
If the null hypothesis fails to be rejected, this test may provide evidence that the series is non-stationary.
Augmented Dickey-Fuller Test
data: adm_imp_movingavg$mean_temperature
Dickey-Fuller = -7.3405, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
Since the p-value is 0.01, which is less than critical value of 0.05, we reject the null hypothesis. This means that the time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.
Null Hypothesis: Series is trend stationary or series has no unit root. Alternative Hypothesis: Series is non-stationary, or series has a unit root.
Note: The hypothesis is reversed in the KPSS test compared to ADF Test.
KPSS Test for Level Stationarity
data: adm_imp_movingavg$mean_temperature
KPSS Level = 0.086074, Truncation lag parameter = 4, p-value = 0.1
Since the p-value is 0.1, which is greater than the critical value of 0.05, we fail to reject the null hypothesis of the KPSS test.This means we can assume that the time series is trend stationary.
Both ADF and KPSS tests conclude that the given series is stationary. This means that we can make use of most of the time series forecasting models such as Exponential Smoothing and ARIMA.
Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several component to help us improve our understanding of the time series and forecast accuracy.
First, let us plot the monthly mean temperature of the Admiralty weather station.
From the above output, it seems like there were fluctuations in monthly mean temperature but there was no increasing trend.
It also seems like for each year, the mean temperature would usually be the highest in May/Jun of each year as indicated by the peaks. Also, for each year, the lowest mean temperature would usually be around Dec/ Jan.
To find out if there is a seasonality, trend and cycle, we can decompose a time series object using stl()from xts package. STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. The STL method was developed by R. B. Cleveland, Cleveland, McRae, & Terpenning (1990).
In the following code chunk, we use: - stl() to decompose the time series object - ts_ts() function from the library converts an xts field to a ts object that can be used with stl().

Call:
stl(x = ts_ts(adm_imp_movingavg$mean_temperature), s.window = "periodic")
Time.series components:
seasonal trend remainder
Min. :-0.9176109 Min. :27.30177 Min. :-0.9341622
1st Qu.:-0.5109855 1st Qu.:27.43056 1st Qu.:-0.2329592
Median : 0.1857624 Median :27.65282 Median :-0.0124042
Mean : 0.0000000 Mean :27.66756 Mean :-0.0019506
3rd Qu.: 0.4489919 3rd Qu.:27.85388 3rd Qu.: 0.2320894
Max. : 0.7298073 Max. :28.19072 Max. : 0.9402058
IQR:
STL.seasonal STL.trend STL.remainder data
0.9600 0.4233 0.4650 1.0218
% 94.0 41.4 45.5 100.0
Weights: all == 1
Other components: List of 5
$ win : Named num [1:3] 1801 19 13
$ deg : Named int [1:3] 0 1 1
$ jump : Named num [1:3] 181 2 2
$ inner: int 2
$ outer: int 0
From the above output, we noted that there is no clear linear trend for the monthly mean temperature over the years. However, we do note that the monthly mean temperature ranges from 27.3 degrees celsius to ~28.2 degree celsius.
We observed seasonality in the monthly mean temperature over the years.
First we will split the data into training and validation data.
When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.
The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required.
For this section, we will set the test set ~20% of the dataframe we have. However when building the Shiny dashboard, we should allow user input on the duration they want to forecast (e.g. next few months or next few years) because this would affect the size of the test dataset.
Some forecasting methods are extremely simple and surprisingly effective. We will use the following two forecasting methods (i.e. naive and seasonal naive) as benchmarks.
For naïve forecasts, we simply set all forecasts to be the value of the last observation.
Mean absolute percentage error (MAPE) is the percentage equivalent of mean absolute error (MAE). Mean absolute percentage error measures the average magnitude of error produced by a model, or how far off predictions are on average.
To measure the performance of how well the model’s forecasted values as compared to the test dataset, we use MAPE() of MLmetrics package to calculate the MAPE.
From the above output, we have a MAPE of 2.79% meaning that the average difference between the forecasted value and the actual value is 2.79%.For a simple model, this forecasting accuracy is very good! It means that other models introduced would need to have a even lower MAPE in order for us to consider them.
A similar method is useful for highly seasonal data. In this case, we set each forecast to be equal to the last observed value from the same season (e.g.,the same month of the previous year).
From the above output, we have a MAPE of 2.07% meaning that the average difference between the forecasted value and the actual value is 2.07%.For a simple model, this forecasting accuracy is even better than the naive model! It means that other models introduced would need to have a even lower MAPE in order for us to consider them.
The simplest of the exponentially smoothing methods is naturally called simple exponential smoothing (SES). This method is suitable for forecasting data with no clear trend or seasonal pattern.
From the above output, we have a MAPE of 2.78% meaning that the average difference between the forecasted value and the actual value is 2.78%, which is slightly better than the naive model and poorer performance than the seasonal naive model.
State space models provide a flexible framework for modeling time series data. They consist of two components: the state equation and the observation equation. The state equation describes how the underlying states of the system evolve over time, while the observation equation relates the observed data to the underlying states.
State space models allow us to capture complex dynamics and dependencies in the data, making them suitable for a wide range of applications, including finance, economics and engineering.
We use ets() from forecast package to find out the optimal model.
ETS(M,N,A)
Call:
ets(y = trainingtemp)
Smoothing parameters:
alpha = 0.2716
gamma = 1e-04
Initial states:
l = 27.8746
s = -0.8262 -0.4477 0.1394 0.235 0.4386 0.5042
0.6772 0.6073 0.3135 -0.0647 -0.631 -0.9455
sigma: 0.0158
AIC AICc BIC
544.5750 548.0035 590.3228
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.005080769 0.4168655 0.336926 -0.03791676 1.219642 0.7034614
ACF1
Training set 0.1037638
We see ETS (M,N,A). This means we have an ets model with multiplicative errors, no trend and a additive seasonality. Additive seasonality means there aren’t any changes to widths or heights of seasonal periods over time.
From the above output, we have a MAPE of 1.47% meaning that the average difference between the forecasted value and the actual value is 1.47%, which is so far the best performing model.
Since time series analysis decomposes past weather observations into seasonal, trend, and random components, that data can be used to create forecasts based on an assumption that the observed patterns will continue into the future. This type of forecasting is especially useful in business for anticipating potential demand for seasonal products.
To forecast future temperatures based on historical observations, we can use Holt-Winters model that considers past seasonal cycles, trends, and random variation.
Note that for Holt-Winters’ method, we can choose additive or multiplicative seasonality to forecast the monthly mean temperature, which can be part of the user input.
From the above output, we have a MAPE of 1.47% meaning that the average difference between the forecasted value and the actual value is 1.47%, which gives us as good performance as the State Space Model.
From the above output, we have a MAPE of 1.53% meaning that the average difference between the forecasted value and the actual value is 1.53%, which is relatively slightly poorer than the Holt-Winters’ Additive Seasonality Model.
ARIMA models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, auto regressive integrated moving average (ARIMA) modeling involves a more detailed analysis of the training data using lags and lagged forecast errors.
The first step is to use a function like auto.arima() to analyze the data and find appropriate model configuration parameters.
Series: trainingtemp
ARIMA(1,0,1)(2,1,1)[12] with drift
Coefficients:
ar1 ma1 sar1 sar2 sma1 drift
0.7912 -0.4842 0.0130 -0.1503 -0.8839 0.0013
s.e. 0.1180 0.1735 0.1349 0.1298 0.1996 0.0018
sigma^2 = 0.1948: log likelihood = -93.88
AIC=201.76 AICc=202.59 BIC=222.55
The function returned the following model: ARIMA (1,0,1)(2,1,1)[12] with drift.
From the above output, we have a MAPE of 1.55% meaning that the average difference between the forecasted value and the actual value is 1.55%, which gives us better performance than the benchmark models.
| Model | MAPE (%) |
|---|---|
| Naive | 2.79 |
| Seasonal Naive | 2.07 |
| Simple Exponential Smoothing | 2.78 |
| State Space Model | 1.47 |
| Holt-Winters’ Model (Additive Seasonality) | 1.47 |
| Holt-Winters’ Model (Multiplicative Seasonality) | 1.53 |
| ARIMA | 1.55 |
From the above table, the state space model, Holt-Winters’ model and ARIMA model all outperformed the benchmark models (i.e. naive Model and Seasonal Naive Model) for temperature data. We can consider letting users to choose to use these models when forecasting temperature data.
In this section, we will be cleaning the rainfall data and building models for the rainfall data.
First we select the relevant rainfall related columns needed for this exercise using select(). We will examine the resultant dataframe rainfall using str().
rainfall) is a tibble dataframe with the following columns:
tdate: refers to the date of the rainfall reading is collected.station: refers to the weather station that collected this rainfall reading.daily_rainfall_total: refers to the total amount of rainfall observed by this weather station in a day. The unit of measurement is in mm.Let us save the dataframe into an RDS file for future usage using write_rds().
We will bring in the rainfall data using read_rds(). Let us check the imported RDS data using str() again.
tibble [329,156 × 3] (S3: tbl_df/tbl/data.frame)
$ tdate : Date[1:329156], format: "1980-01-01" "1980-01-02" ...
$ station : chr [1:329156] "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" ...
$ daily_rainfall_total: num [1:329156] 0 0 0 0 22.6 49.6 2.4 0 0 0 ...
rainfall is a tibble dataframe.station is character data type, tdate is date data type and daily_rainfall_total is numeric data.First, let us use summary() to check for missing data.
tdate station daily_rainfall_total
Min. :1980-01-01 Length:329156 Min. : 0.000
1st Qu.:1997-04-29 Class :character 1st Qu.: 0.000
Median :2011-09-18 Mode :character Median : 0.200
Mean :2007-07-02 Mean : 6.822
3rd Qu.:2017-11-27 3rd Qu.: 6.500
Max. :2023-12-31 Max. :278.600
NA's :58 NA's :5136
The observations ranged from 1 Jan 1980 to 31 Dec 2023. There are 58 rows with missing dates. We should drop these rows since they are unable to tell us which day the observations were made (even if they have rainfall readings).
There are 5,136 rows of NAs for daily rainfall.
First, let us drop those rows where date is missing because we would not be able to definitively identify when the temperature(s) were collected (even if there were temperature readings for these rows.
tdate station daily_rainfall_total
Min. :1980-01-01 Length:329098 Min. : 0.000
1st Qu.:1997-04-29 Class :character 1st Qu.: 0.000
Median :2011-09-18 Mode :character Median : 0.200
Mean :2007-07-02 Mean : 6.822
3rd Qu.:2017-11-27 3rd Qu.: 6.500
Max. :2023-12-31 Max. :278.600
NA's :5078
Let us also do a check of the weather stations.
[1] "Macritchie Reservoir" "Lower Peirce Reservoir"
[3] "Admiralty" "East Coast Parkway"
[5] "Ang Mo Kio" "Newton"
[7] "Lim Chu Kang" "Marine Parade"
[9] "Choa Chu Kang (Central)" "Tuas South"
[11] "Pasir Panjang" "Jurong Island"
[13] "Nicoll Highway" "Botanic Garden"
[15] "Choa Chu Kang (South)" "Whampoa"
[17] "Changi" "Jurong Pier"
[19] "Ulu Pandan" "Mandai"
[21] "Tai Seng" "Jurong (West)"
[23] "Clementi" "Sentosa Island"
[25] "Bukit Panjang" "Kranji Reservoir"
[27] "Upper Peirce Reservoir" "Kent Ridge"
[29] "Queenstown" "Tanjong Katong"
[31] "Somerset (Road)" "Punggol"
[33] "Simei" "Toa Payoh"
[35] "Tuas" "Bukit Timah"
[37] "Pasir Ris (Central)"
From the previous section, we noted that there are many weather stations in the rainfall dataframe. Hence, we will make use of plotly to further explore the missing temperatures.
First, we will pivot the dataframe wider.
tdate Macritchie Reservoir Lower Peirce Reservoir
Min. :1980-01-01 Min. : 0.000 Min. : 0.000
1st Qu.:1990-12-31 1st Qu.: 0.000 1st Qu.: 0.000
Median :2001-12-31 Median : 0.200 Median : 0.400
Mean :2001-12-31 Mean : 7.145 Mean : 7.706
3rd Qu.:2012-12-30 3rd Qu.: 7.100 3rd Qu.: 8.200
Max. :2023-12-31 Max. :256.000 Max. :227.600
NA's :121 NA's :11141
Admiralty East Coast Parkway Ang Mo Kio Newton
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.400 Median : 0.000 Median : 0.400 Median : 0.200
Mean : 6.735 Mean : 4.977 Mean : 7.218 Mean : 6.625
3rd Qu.: 6.800 3rd Qu.: 3.800 3rd Qu.: 7.800 3rd Qu.: 6.800
Max. :142.000 Max. :192.600 Max. :164.400 Max. :150.400
NA's :10785 NA's :10778 NA's :10901 NA's :11112
Lim Chu Kang Marine Parade Choa Chu Kang (Central) Tuas South
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.400 Median : 0.00 Median : 0.40 Median : 0.200
Mean : 6.759 Mean : 5.61 Mean : 7.51 Mean : 6.892
3rd Qu.: 7.000 3rd Qu.: 4.95 3rd Qu.: 8.60 3rd Qu.: 6.200
Max. :158.600 Max. :195.40 Max. :144.20 Max. :208.200
NA's :11214 NA's :10933 NA's :10994 NA's :11472
Pasir Panjang Jurong Island Nicoll Highway Botanic Garden
Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.20 Median : 0.000 Median : 0.200 Median : 0.200
Mean : 6.29 Mean : 6.287 Mean : 6.548 Mean : 7.283
3rd Qu.: 6.00 3rd Qu.: 5.600 3rd Qu.: 6.200 3rd Qu.: 7.800
Max. :151.60 Max. :185.200 Max. :163.800 Max. :160.400
NA's :11031 NA's :11582 NA's :11269 NA's :11228
Choa Chu Kang (South) Whampoa Changi Jurong Pier
Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.400 Median : 0.200 Median : 0.00 Median : 0.200
Mean : 7.469 Mean : 6.888 Mean : 5.81 Mean : 7.131
3rd Qu.: 8.200 3rd Qu.: 7.100 3rd Qu.: 4.40 3rd Qu.: 7.000
Max. :143.800 Max. :154.600 Max. :216.20 Max. :226.400
NA's :11491 NA's :11606 NA's :275
Ulu Pandan Mandai Tai Seng Jurong (West)
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.200 Median : 0.400 Median : 0.000 Median : 0.300
Mean : 7.201 Mean : 7.218 Mean : 6.757 Mean : 7.224
3rd Qu.: 6.900 3rd Qu.: 7.150 3rd Qu.: 5.900 3rd Qu.: 7.000
Max. :230.400 Max. :247.200 Max. :217.200 Max. :226.200
NA's :355 NA's :828 NA's :5 NA's :445
Clementi Sentosa Island Bukit Panjang Kranji Reservoir
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.300 Median : 0.000 Median : 0.400 Median : 0.300
Mean : 7.216 Mean : 6.166 Mean : 7.316 Mean : 7.041
3rd Qu.: 7.200 3rd Qu.: 5.100 3rd Qu.: 7.400 3rd Qu.: 7.100
Max. :239.500 Max. :220.800 Max. :235.600 Max. :239.800
NA's :207 NA's :365 NA's :227 NA's :234
Upper Peirce Reservoir Kent Ridge Queenstown Tanjong Katong
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.400 Median : 0.200 Median : 0.200 Median : 0.100
Mean : 7.527 Mean : 7.365 Mean : 6.805 Mean : 6.136
3rd Qu.: 7.900 3rd Qu.: 7.400 3rd Qu.: 6.100 3rd Qu.: 5.100
Max. :202.800 Max. :179.600 Max. :278.600 Max. :226.000
NA's :11168 NA's :10858 NA's :773 NA's :392
Somerset (Road) Punggol Simei Toa Payoh
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.200 Median : 0.200 Median : 0.000 Median : 0.200
Mean : 6.387 Mean : 6.594 Mean : 5.987 Mean : 7.211
3rd Qu.: 6.200 3rd Qu.: 6.400 3rd Qu.: 5.200 3rd Qu.: 7.600
Max. :155.800 Max. :168.800 Max. :182.600 Max. :150.200
NA's :11427 NA's :10767 NA's :11013 NA's :11037
Tuas Bukit Timah Pasir Ris (Central)
Min. : 0.000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.200 Median : 0.40 Median : 0.200
Mean : 7.198 Mean : 7.13 Mean : 6.165
3rd Qu.: 7.600 3rd Qu.: 7.40 3rd Qu.: 5.400
Max. :217.000 Max. :156.80 Max. :185.800
NA's :10690 NA's :10826 NA's :11057
We will make use of plotly to explore the daily rainfall for each station using a dropdown list.
plot_ly(data = rain_wide,
x = ~tdate,
y = ~ Admiralty,
type = "scatter",
mode = "lines") |>
layout(title = "Total Rain Fall observed by Weather Station",
xaxis = list(title = "Date", range(as.Date("1980-01-01"), as.Date("2023-12-31"))),
yaxis = list(title = ""),
theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,
axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),
updatemenus = list(list(type = 'dropdown',
xref = "paper",
yref = "paper",
xanchor = "left",
x = 0.04,
y = 0.95,
buttons = list(
list(method = "update",
args = list(list(y = list(rain_wide$Admiralty)),
list(yaxis = list(title = "Total Rainfall observed in Admiralty"))),label = "Admiralty"),
list(method = "update",
args = list(list(y = list(rain_wide$`East Coast Parkway`)),
list(yaxis = list(title = "Total Rainfall observed in East Coast Parkway"))),label = "East Coast Parkway"),
list(method = "update",
args = list(list(y = list(rain_wide$`Ang Mo Kio`)),
list(yaxis = list(title = "Total Rainfall observed in Ang Mo Kio"))),label = "Ang Mo Kio"),
list(method = "update",
args = list(list(y = list(rain_wide$Newton)),
list(yaxis = list(title = "Total Rainfall observed in Newton"))),label = "Newton"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tuas South`)),
list(yaxis = list(title = "Total Rainfall observed in Tuas South"))),label = "Tuas South"),
list(method = "update",
args = list(list(y = list(rain_wide$`Pasir Panjang`)),
list(yaxis = list(title = "Total Rainfall observed in Pasir Panjang"))),label = "Pasir Panjang"),
list(method = "update",
args = list(list(y = list(rain_wide$`Jurong Island`)),
list(yaxis = list(title = "Total Rainfall observed in Jurong Island"))),label = "Jurong Island"),
list(method = "update",
args = list(list(y = list(rain_wide$`Choa Chu Kang (South)`)),
list(yaxis = list(title = "Total Rainfall observed in Choa Chu Kang (South)"))),label = "Choa Chu Kang (South)"),
list(method = "update",
args = list(list(y = list(rain_wide$Changi)),
list(yaxis = list(title = "Total Rainfall observed in Changi"))),label = "Changi"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tai Seng`)),
list(yaxis = list(title = "Total Rainfall observed in Tai Seng"))),label = "Tai Seng"),
list(method = "update",
args = list(list(y = list(rain_wide$`Jurong (West)`)),
list(yaxis = list(title = "Total Rainfall observed in Jurong West"))),label = "Jurong West"),
list(method = "update",
args = list(list(y = list(rain_wide$Clementi)),
list(yaxis = list(title = "Total Rainfall observed in Clementi"))),label = "Clementi"),
list(method = "update",
args = list(list(y = list(rain_wide$`Sentosa Island`)),
list(yaxis = list(title = "Total Rainfall observed in Sentosa"))),label = "Sentosa"),
list(method = "update",
args = list(list(y = list(rain_wide$`Macritchie Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Macritchie Reservoir"))),label = "Macritchie Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Lower Peirce Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Lower Peirce Reservoir"))),label = "Lower Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Lim Chu Kang`)),
list(yaxis = list(title = "Total Rainfall observed at Lim Chu Kang"))),label = "Lim Chu Kang"),
list(method = "update",
args = list(list(y = list(rain_wide$`Marine Parade`)),
list(yaxis = list(title = "Total Rainfall observed at Marine Parade"))),label = "Marine Parade"),
list(method = "update",
args = list(list(y = list(rain_wide$`Choa Chu Kang (Central)`)),
list(yaxis = list(title = "Total Rainfall observed at Choa Chu Kang (Central)"))),label = "Choa Chu Kang (Central)"),
list(method = "update",
args = list(list(y = list(rain_wide$`Nicoll Highway`)),
list(yaxis = list(title = "Total Rainfall observed at Nicoll Highway"))),label = "Nicoll Highway"),
list(method = "update",
args = list(list(y = list(rain_wide$`Botanic Garden`)),
list(yaxis = list(title = "Total Rainfall observed at Botanic Garden"))),label = "Botanic Garden"),
list(method = "update",
args = list(list(y = list(rain_wide$Whampoa)),
list(yaxis = list(title = "Total Rainfall observed at Whampoa"))),label = "Whampoa"),
list(method = "update",
args = list(list(y = list(rain_wide$`Jurong Pier`)),
list(yaxis = list(title = "Total Rainfall observed at Jurong Pier"))),label = "Jurong Pier"),
list(method = "update",
args = list(list(y = list(rain_wide$`Ulu Pandan`)),
list(yaxis = list(title = "Total Rainfall observed at Ulu Pandan"))),label = "Ulu Pandan"),
list(method = "update",
args = list(list(y = list(rain_wide$Mandai)),
list(yaxis = list(title = "Total Rainfall observed at Mandai"))),label = "Mandai"),
list(method = "update",
args = list(list(y = list(rain_wide$`Bukit Panjang`)),
list(yaxis = list(title = "Total Rainfall observed at Bukit Panjang"))),label = "Bukit Panjang"),
list(method = "update",
args = list(list(y = list(rain_wide$`Kranji Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Kranji Reservoir"))),label = "Kranji Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Upper Peirce Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Upper Peirce Reservoir"))),label = "Upper Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Kent Ridge`)),
list(yaxis = list(title = "Total Rainfall observed at Kent Ridge"))),label = "Kent Ridge"),
list(method = "update",
args = list(list(y = list(rain_wide$Queenstown)),
list(yaxis = list(title = "Total Rainfall observed at Queenstown"))),label = "Queenstown"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tanjong Katong`)),
list(yaxis = list(title = "Total Rainfall observed at Tanjong Katong"))),label = "Tanjong Katong"),
list(method = "update",
args = list(list(y = list(rain_wide$`Somerset (Road)`)),
list(yaxis = list(title = "Total Rainfall observed at Somerset (Road)"))),label = "Somerset (Road)"),
list(method = "update",
args = list(list(y = list(rain_wide$`Punggol`)),
list(yaxis = list(title = "Total Rainfall observed at Punggol"))),label = "Punggol"),
list(method = "update",
args = list(list(y = list(rain_wide$`Simei`)),
list(yaxis = list(title = "Total Rainfall observed at Simei"))),label = "Simei"),
list(method = "update",
args = list(list(y = list(rain_wide$`Toa Payoh`)),
list(yaxis = list(title = "Total Rainfall observed at Toa Payoh"))),label = "Toa Payoh"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tuas`)),
list(yaxis = list(title = "Total Rainfall observed at Tuas"))),label = "Tuas"),
list(method = "update",
args = list(list(y = list(rain_wide$`Bukit Timah`)),
list(yaxis = list(title = "Total Rainfall observed at Bukit Timah"))),label = "Bukit Timah"),
list(method = "update",
args = list(list(y = list(rain_wide$`Pasir Ris (Central)`)),
list(yaxis = list(title = "Total Rainfall observed at Pasir Ris (Central)"))),label = "Pasir Ris (Central)")
)))) Let us find out the amount of missing values for each weather station using the following code chunk.
missing_values <- rain_wide %>%
gather(key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
group_by(key) %>%
mutate(total = n()) %>%
group_by(key, total, isna) %>%
summarise(num.isna = n()) %>%
mutate(pct = num.isna / total * 100)
levels <-
(missing_values %>% filter(isna == T) %>% arrange(desc(pct)))$key
percentage_plot <- missing_values %>%
ggplot() +
geom_bar(aes(x = reorder(key, desc(pct)),
y = pct, fill=isna),
stat = 'identity', alpha=0.8) +
scale_x_discrete(limits = levels) +
scale_fill_manual(name = "",
values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
coord_flip() +
labs(title = "Percentage of missing values", x =
'Variable', y = "% of missing values")Let us check if there are any stations where they have 100% missing values.
character(0)
From the above output, it seems like no stations have 100% missing values.
Let us check if there are any stations with no missing values.
[1] "Changi" "tdate"
From the above output, it seems like only Changi weather station has no missing values. This means that for other weather stations there are some amount of missing data for each weather station. Let us impute the missing values in the next section.
As mentioned earlier, for us to make use of the time series forecasting packages and their functions, we would need to convert the tibble dataframe into a time series object.
Before we create the time series object, let us first aggregate the daily rainfall readings to monthly rainfall readings by (1) creating the year-month column for each observation using floor_date() and specifying it to derive the year and month of each observation, and (2) aggregate the temperature readings by station and year_month then use summarise() to compute the monthly rainfall reading.
Rows: 329,098
Columns: 4
$ tdate <date> 1980-01-01, 1980-01-02, 1980-01-03, 1980-01-04, …
$ station <chr> "Macritchie Reservoir", "Macritchie Reservoir", "…
$ daily_rainfall_total <dbl> 0.0, 0.0, 0.0, 0.0, 22.6, 49.6, 2.4, 0.0, 0.0, 0.…
$ year_month <date> 1980-01-01, 1980-01-01, 1980-01-01, 1980-01-01, …
Rows: 10,813
Columns: 3
Groups: station [37]
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty", "Admira…
$ year_month <date> 2009-01-01, 2009-02-01, 2009-03-01, 2009-04-01, 2009-05-01…
$ total_rf <dbl> NA, 148.0, NA, 148.8, 205.6, 92.0, 103.0, 90.2, 67.6, 160.0…
With the monthly temperature of all weather stations, let us filter out one weather station (e.g. Admiralty) to create a tibble data frame adm_rf so that we can convert it into an xts object, which is a type of time series object.
station year_month total_rf
Length:179 Min. :2009-01-01 Min. : 15.8
Class :character 1st Qu.:2012-09-16 1st Qu.:124.4
Mode :character Median :2016-07-01 Median :189.0
Mean :2016-06-19 Mean :204.9
3rd Qu.:2020-03-16 3rd Qu.:270.4
Max. :2023-12-01 Max. :517.4
NA's :18
We will use xts() from xts package to create a time series object. The order.by parameter uses the dates from the adm_rf dataframe. We then use the ts_regular() function to give the time series object adm_rf_xts a regular interval by adding NA values for missing dates.
Just in case there are missing months which we did not detected, we use the na.fill() function fills in those missing dates by extending values from previous days.
Let us plot out the monthly rainfall of Admiralty weather station using ggplotly.
p3 <- ggplot(adm_rf_xts, aes(x = Index, y = value)) +
geom_line() + theme_clean() +
labs(title = "Monthly Rainfall of Admiralty Weather Station", caption = "Data from Weather.gov.sg") +
xlab("Month-Year") +
ylab("Rainfall (in mm)") +
theme_ipsum_rc()
ggplotly(p3)From the above output, we see that there are missing temperatures for numerous time periods. As a result, the line for the above chart is not continuous.
Let us investigate futher using imputeTS package’s ggplot_na_distribution, which highlights the missing values in our data.
There are several ways to impute missing data in time series objects.
This na_ma()function also allows us to use linear-weighted and exponentially-weighted moving averages.
We can also use Kalman Smoothing on ARIMA model to impute the missing values.
admrf_imp_kalman <- na_kalman(adm_rf_xts, model = "auto.arima")
#plot chart
ggplot(admrf_imp_kalman, aes(x = Index, y = value)) +
geom_line()
Kalman Smoothing also has a “StrucTS” option. Let us try and see how it works for our monthly rainfall data.